Motion-Appearance Co-Memory Networks for Video Question Answering
نویسندگان
چکیده
Video Question Answering (QA) is an important task in understanding video temporal structure. We observe that there are three unique attributes of video QA compared with image QA: (1) it deals with long sequences of images containing richer information not only in quantity but also in variety; (2) motion and appearance information are usually correlated with each other and able to provide useful attention cues to the other; (3) different questions require different number of frames to infer the answer. Based on these observations, we propose a motion-appearance comemory network for video QA. Our networks are built on concepts from Dynamic Memory Network (DMN) and introduces new mechanisms for video QA. Specifically, there are three salient aspects: (1) a co-memory attention mechanism that utilizes cues from both motion and appearance to generate attention; (2) a temporal conv-deconv network to generate multi-level contextual facts; (3) a dynamic fact ensemble method to construct temporal representation dynamically for different questions. We evaluate our method on TGIF-QA dataset, and the results outperform state-ofthe-art significantly on all four tasks of TGIF-QA. ∗ indicates equal contributions.
منابع مشابه
DeepStory: Video Story QA by Deep Embedded Memory Networks
Question-answering (QA) on video contents is a significant challenge for achieving human-level intelligence as it involves both vision and language in real-world settings. Here we demonstrate the possibility of an AI agent performing video story QA by learning from a large amount of cartoon videos. We develop a video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruc...
متن کاملUncovering Temporal Context for Video Question and Answering
In this work, we introduce Video Question Answering in temporal domain to infer the past, describe the present and predict the future. We present an encoder-decoder approach using Recurrent Neural Networks to learn temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions. We explore approaches for finer understanding of video content using ques...
متن کاملExploring Deep Learning Models for Machine Comprehension on SQuAD
This paper explores the use of multiple models in performing question answering tasks on the Stanford Question Answering Database. We first implement and share results of a baseline model using bidirectional long short-term memory (BiLSTM) encoding of question and context followed a simple co-attention model [1]. We then report on the use of match-LSTM and Pointer Net which showed marked improv...
متن کاملDynamic Inference: Using Dynamic Memory Networks for Question Answering
Question Answering is an incredibly important task in Natural Language Processing (NLP), and we aim to experiment with models in Machine Comprehension to attempt to perform well on Question Answering tasks. We specifically work with the Faceboook bAbi dataset, and aim to achieve strong results using Dynamic Memory Networks, which are known to have strong performance for this task. Dynamic Memor...
متن کاملVideo Question Answering via Hierarchical Spatio-Temporal Attention Networks
Open-ended video question answering is a challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced video content according to the question. However, the existing visual question answering works only focus on the static image, which may be ineffectively applied to video question answering due to the lack of modeling the tem...
متن کامل